Time Range: 2015-2019
Features: 'Country', 'Year', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Dystopia Residual'
Number of Countries: Varies year by year, around 150 countries each year
Missing Values: Some of the features such as 'Trust (Government Corruption)' and 'Generosity' have missing values in certain years for certain countries. These were handled during data pre-processing.
Time Range: Matches with the Happiness Report, i.e., 2015-2019
Features: 'Access to clean fuels and technologies for cooking (% of population)', 'Access to electricity (% of population)', 'Agriculture, forestry, and fishing, value added (% of GDP)', 'Imports of goods and services (% of GDP)', 'Industry (including construction), value added (% of GDP)', 'Manufacturing, value added (% of GDP)', 'Military expenditure (% of GDP)', 'Tax revenue (% of GDP)'
Number of Countries: Data available for most countries globally, but only the ones present in the Happiness Report were considered for the final merged dataset
Missing Values: Some missing values were observed, which were handled during data pre-processing.
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Loop through the years 2015-2019
for year in range(2015, 2020):
# Read the CSV file from the corresponding folder
filepath = f'datasets/{year}/{year}.csv'
df = pd.read_csv(filepath)
# Add a new column 'year' to the dataframe and fill it with the corresponding year
df['year'] = year
# Display the first few rows of the dataframe
print(f"Year {year}:")
print(df.head())
print("\n")
Year 2015:
Country Region Happiness Rank Happiness Score \
0 Switzerland Western Europe 1 7.587
1 Iceland Western Europe 2 7.561
2 Denmark Western Europe 3 7.527
3 Norway Western Europe 4 7.522
4 Canada North America 5 7.427
Standard Error Economy (GDP per Capita) Family \
0 0.03411 1.39651 1.34951
1 0.04884 1.30232 1.40223
2 0.03328 1.32548 1.36058
3 0.03880 1.45900 1.33095
4 0.03553 1.32629 1.32261
Health (Life Expectancy) Freedom Trust (Government Corruption) \
0 0.94143 0.66557 0.41978
1 0.94784 0.62877 0.14145
2 0.87464 0.64938 0.48357
3 0.88521 0.66973 0.36503
4 0.90563 0.63297 0.32957
Generosity Dystopia Residual year
0 0.29678 2.51738 2015
1 0.43630 2.70201 2015
2 0.34139 2.49204 2015
3 0.34699 2.46531 2015
4 0.45811 2.45176 2015
Year 2016:
Country Region Happiness Rank Happiness Score \
0 Denmark Western Europe 1 7.526
1 Switzerland Western Europe 2 7.509
2 Iceland Western Europe 3 7.501
3 Norway Western Europe 4 7.498
4 Finland Western Europe 5 7.413
Lower Confidence Interval Upper Confidence Interval \
0 7.460 7.592
1 7.428 7.590
2 7.333 7.669
3 7.421 7.575
4 7.351 7.475
Economy (GDP per Capita) Family Health (Life Expectancy) Freedom \
0 1.44178 1.16374 0.79504 0.57941
1 1.52733 1.14524 0.86303 0.58557
2 1.42666 1.18326 0.86733 0.56624
3 1.57744 1.12690 0.79579 0.59609
4 1.40598 1.13464 0.81091 0.57104
Trust (Government Corruption) Generosity Dystopia Residual year
0 0.44453 0.36171 2.73939 2016
1 0.41203 0.28083 2.69463 2016
2 0.14975 0.47678 2.83137 2016
3 0.35776 0.37895 2.66465 2016
4 0.41004 0.25492 2.82596 2016
Year 2017:
Country Happiness.Rank Happiness.Score Whisker.high Whisker.low \
0 Norway 1 7.537 7.594445 7.479556
1 Denmark 2 7.522 7.581728 7.462272
2 Iceland 3 7.504 7.622030 7.385970
3 Switzerland 4 7.494 7.561772 7.426227
4 Finland 5 7.469 7.527542 7.410458
Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom \
0 1.616463 1.533524 0.796667 0.635423
1 1.482383 1.551122 0.792566 0.626007
2 1.480633 1.610574 0.833552 0.627163
3 1.564980 1.516912 0.858131 0.620071
4 1.443572 1.540247 0.809158 0.617951
Generosity Trust..Government.Corruption. Dystopia.Residual year
0 0.362012 0.315964 2.277027 2017
1 0.355280 0.400770 2.313707 2017
2 0.475540 0.153527 2.322715 2017
3 0.290549 0.367007 2.276716 2017
4 0.245483 0.382612 2.430182 2017
Year 2018:
Overall rank Country or region Score GDP per capita Social support \
0 1 Finland 7.632 1.305 1.592
1 2 Norway 7.594 1.456 1.582
2 3 Denmark 7.555 1.351 1.590
3 4 Iceland 7.495 1.343 1.644
4 5 Switzerland 7.487 1.420 1.549
Healthy life expectancy Freedom to make life choices Generosity \
0 0.874 0.681 0.202
1 0.861 0.686 0.286
2 0.868 0.683 0.284
3 0.914 0.677 0.353
4 0.927 0.660 0.256
Perceptions of corruption year
0 0.393 2018
1 0.340 2018
2 0.408 2018
3 0.138 2018
4 0.357 2018
Year 2019:
Overall rank Country or region Score GDP per capita Social support \
0 1 Finland 7.769 1.340 1.587
1 2 Denmark 7.600 1.383 1.573
2 3 Norway 7.554 1.488 1.582
3 4 Iceland 7.494 1.380 1.624
4 5 Netherlands 7.488 1.396 1.522
Healthy life expectancy Freedom to make life choices Generosity \
0 0.986 0.596 0.153
1 0.996 0.592 0.252
2 1.028 0.603 0.271
3 1.026 0.591 0.354
4 0.999 0.557 0.322
Perceptions of corruption year
0 0.393 2019
1 0.410 2019
2 0.341 2019
3 0.118 2019
4 0.298 2019
def standardize_columns(df):
column_mapping = {
'Happiness.Rank': 'Happiness Rank',
'Happiness.Score': 'Happiness Score',
'Economy..GDP.per.Capita.': 'Economy (GDP per Capita)',
'Health..Life.Expectancy.': 'Health (Life Expectancy)',
'Trust..Government.Corruption.': 'Trust (Government Corruption)',
'Dystopia.Residual': 'Dystopia Residual',
'Overall rank': 'Happiness Rank',
'Country or region': 'Country',
'Score': 'Happiness Score',
'GDP per capita': 'Economy (GDP per Capita)',
'Social support': 'Family',
'Healthy life expectancy': 'Health (Life Expectancy)',
'Freedom to make life choices': 'Freedom',
'Perceptions of corruption': 'Trust (Government Corruption)',
}
df = df.rename(columns=column_mapping)
return df
def fill_missing_columns(df, columns):
for column in columns:
if column not in df.columns:
df[column] = float('nan')
return df
def save_df_to_csv(folder_path,df):
os.makedirs(folder_path, exist_ok=True)
file_name = f"{df.name}.csv"
file_path = os.path.join(folder_path, file_name)
df.to_csv(file_path)
# Define the standard columns
standard_columns = [
'Country', 'year', 'Happiness Rank', 'Happiness Score',
'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
'Freedom', 'Trust (Government Corruption)', 'Generosity',
'Dystopia Residual'
]
# Initialize an empty list to store the dataframes
dataframes = []
# Loop through the years 2015-2019
for year in range(2015, 2020):
# Read the CSV file from the corresponding folder
filepath = f'datasets/{year}/{year}.csv'
df = pd.read_csv(filepath)
# Add a new column 'year' to the dataframe and fill it with the corresponding year
df['year'] = year
# Standardize the column names
df = standardize_columns(df)
# Fill in any missing columns with NaN
df = fill_missing_columns(df, standard_columns)
# Select only the standard columns
df = df[standard_columns]
# Append the dataframe to the list of dataframes
dataframes.append(df)
combined_df = pd.concat(dataframes, ignore_index=True)
combined_df.name = 'combined_df' # Set the name of the DataFrame
save_df_to_csv('combined_df', combined_df)
file_name = f"{df}.csv"
combined_df
| Country | year | Happiness Rank | Happiness Score | Economy (GDP per Capita) | Family | Health (Life Expectancy) | Freedom | Trust (Government Corruption) | Generosity | Dystopia Residual | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Switzerland | 2015 | 1 | 7.587 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2.51738 |
| 1 | Iceland | 2015 | 2 | 7.561 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2.70201 |
| 2 | Denmark | 2015 | 3 | 7.527 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2.49204 |
| 3 | Norway | 2015 | 4 | 7.522 | 1.45900 | 1.33095 | 0.88521 | 0.66973 | 0.36503 | 0.34699 | 2.46531 |
| 4 | Canada | 2015 | 5 | 7.427 | 1.32629 | 1.32261 | 0.90563 | 0.63297 | 0.32957 | 0.45811 | 2.45176 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 777 | Rwanda | 2019 | 152 | 3.334 | 0.35900 | 0.71100 | 0.61400 | 0.55500 | 0.41100 | 0.21700 | NaN |
| 778 | Tanzania | 2019 | 153 | 3.231 | 0.47600 | 0.88500 | 0.49900 | 0.41700 | 0.14700 | 0.27600 | NaN |
| 779 | Afghanistan | 2019 | 154 | 3.203 | 0.35000 | 0.51700 | 0.36100 | 0.00000 | 0.02500 | 0.15800 | NaN |
| 780 | Central African Republic | 2019 | 155 | 3.083 | 0.02600 | 0.00000 | 0.10500 | 0.22500 | 0.03500 | 0.23500 | NaN |
| 781 | South Sudan | 2019 | 156 | 2.853 | 0.30600 | 0.57500 | 0.29500 | 0.01000 | 0.09100 | 0.20200 | NaN |
782 rows × 11 columns
# Read the CSV file from the corresponding folder
eco_path = f'datasets/Eco_data.csv'
israel_eco_data = pd.read_csv(eco_path)
israel_eco_data
| Country Name | Country Code | Series Name | Series Code | 2015 [YR2015] | 2016 [YR2016] | 2017 [YR2017] | 2018 [YR2018] | 2019 [YR2019] | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Israel | ISR | Access to electricity (% of population) | EG.ELC.ACCS.ZS | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
| 1 | Israel | ISR | Access to clean fuels and technologies for coo... | EG.CFT.ACCS.ZS | 100.000000 | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
| 2 | Israel | ISR | Agriculture, forestry, and fishing, value adde... | NV.AGR.TOTL.ZS | 1.418493 | 1.448635 | 1.431974 | 1.385569 | 1.338553 |
| 3 | Israel | ISR | Imports of goods and services (% of GDP) | NE.IMP.GNFS.ZS | 28.041051 | 28.089027 | 27.318128 | 29.021044 | 27.047940 |
| 4 | Israel | ISR | Industry (including construction), value added... | NV.IND.TOTL.ZS | 19.822742 | 18.973514 | 18.386093 | 18.613306 | 18.634297 |
| 5 | Israel | ISR | Manufacturing, value added (% of GDP) | NV.IND.MANF.ZS | 12.754977 | 11.905269 | 11.320437 | 11.492926 | 11.394308 |
| 6 | Israel | ISR | Military expenditure (% of GDP) | MS.MIL.XPND.GD.ZS | 5.489119 | 5.467031 | 5.475427 | 5.325956 | 5.110442 |
| 7 | Israel | ISR | Tax revenue (% of GDP) | GC.TAX.TOTL.GD.ZS | 23.065472 | 23.056862 | 24.240975 | 22.694092 | 22.109376 |
# Drop unnecessary columns
israel_eco_data = israel_eco_data.drop(['Country Code', 'Series Code'], axis=1)
# Reshape the dataframe using the melt function
israel_eco_data_melted = pd.melt(israel_eco_data, id_vars=['Country Name', 'Series Name'], var_name='year', value_name='value')
# Pivot the dataframe to have 'Series Name' values as columns
israel_eco_data_pivoted = israel_eco_data_melted.pivot_table(index=['Country Name', 'year'], columns='Series Name', values='value').reset_index()
# Rename the 'Country Name' column to 'Country' for consistency
israel_eco_data_pivoted = israel_eco_data_pivoted.rename(columns={'Country Name': 'Country'})
# Convert the year column to integer
israel_eco_data_pivoted['year'] = israel_eco_data_pivoted['year'].str.extract('(\d+)').astype(int)
israel_eco_data_pivoted
| Series Name | Country | year | Access to clean fuels and technologies for cooking (% of population) | Access to electricity (% of population) | Agriculture, forestry, and fishing, value added (% of GDP) | Imports of goods and services (% of GDP) | Industry (including construction), value added (% of GDP) | Manufacturing, value added (% of GDP) | Military expenditure (% of GDP) | Tax revenue (% of GDP) |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Israel | 2015 | 100.0 | 100.0 | 1.418493 | 28.041051 | 19.822742 | 12.754977 | 5.489119 | 23.065472 |
| 1 | Israel | 2016 | 100.0 | 100.0 | 1.448635 | 28.089027 | 18.973514 | 11.905269 | 5.467031 | 23.056862 |
| 2 | Israel | 2017 | 100.0 | 100.0 | 1.431974 | 27.318128 | 18.386093 | 11.320437 | 5.475427 | 24.240975 |
| 3 | Israel | 2018 | 100.0 | 100.0 | 1.385569 | 29.021044 | 18.613306 | 11.492926 | 5.325956 | 22.694092 |
| 4 | Israel | 2019 | 100.0 | 100.0 | 1.338553 | 27.047940 | 18.634297 | 11.394308 | 5.110442 | 22.109376 |
# Filter the WHR data for Israel
israel_whr_data = combined_df[combined_df['Country'] == 'Israel']
# Merge the WHR data and the new economic data for Israel
israel_data_merged = pd.merge(israel_whr_data, israel_eco_data_pivoted, on=['Country', 'year'])
israel_data_merged.name = 'israel_data_merged' # Set the name of the DataFrame
save_df_to_csv('israel_data_merged', israel_data_merged)
israel_data_merged
| Country | year | Happiness Rank | Happiness Score | Economy (GDP per Capita) | Family | Health (Life Expectancy) | Freedom | Trust (Government Corruption) | Generosity | Dystopia Residual | Access to clean fuels and technologies for cooking (% of population) | Access to electricity (% of population) | Agriculture, forestry, and fishing, value added (% of GDP) | Imports of goods and services (% of GDP) | Industry (including construction), value added (% of GDP) | Manufacturing, value added (% of GDP) | Military expenditure (% of GDP) | Tax revenue (% of GDP) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Israel | 2015 | 11 | 7.278 | 1.228570 | 1.22393 | 0.913870 | 0.413190 | 0.077850 | 0.331720 | 3.088540 | 100.0 | 100.0 | 1.418493 | 28.041051 | 19.822742 | 12.754977 | 5.489119 | 23.065472 |
| 1 | Israel | 2016 | 11 | 7.267 | 1.337660 | 0.99537 | 0.849170 | 0.364320 | 0.087280 | 0.322880 | 3.310290 | 100.0 | 100.0 | 1.448635 | 28.089027 | 18.973514 | 11.905269 | 5.467031 | 23.056862 |
| 2 | Israel | 2017 | 11 | 7.213 | 1.375382 | 1.37629 | 0.838404 | 0.405989 | 0.085242 | 0.330083 | 2.801757 | 100.0 | 100.0 | 1.431974 | 27.318128 | 18.386093 | 11.320437 | 5.475427 | 24.240975 |
| 3 | Israel | 2018 | 19 | 6.814 | 1.301000 | 1.55900 | 0.883000 | 0.533000 | 0.272000 | 0.354000 | NaN | 100.0 | 100.0 | 1.385569 | 29.021044 | 18.613306 | 11.492926 | 5.325956 | 22.694092 |
| 4 | Israel | 2019 | 13 | 7.139 | 1.276000 | 1.45500 | 1.029000 | 0.371000 | 0.082000 | 0.261000 | NaN | 100.0 | 100.0 | 1.338553 | 27.047940 | 18.634297 | 11.394308 | 5.110442 | 22.109376 |
# Read the economic data CSV file
world_eco_path = f'datasets/world_economy.csv'
new_eco_data = pd.read_csv(world_eco_path)
new_eco_data
| Country Name | Country Code | Series Name | Series Code | 2015 [YR2015] | 2016 [YR2016] | 2017 [YR2017] | 2018 [YR2018] | 2019 [YR2019] | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Israel | ISR | Access to electricity (% of population) | EG.ELC.ACCS.ZS | 100 | 100 | 100 | 100 | 100 |
| 1 | Israel | ISR | Access to clean fuels and technologies for coo... | EG.CFT.ACCS.ZS | 100 | 100 | 100 | 100 | 100 |
| 2 | Israel | ISR | Agriculture, forestry, and fishing, value adde... | NV.AGR.TOTL.ZS | 1.418492836 | 1.448634907 | 1.431974316 | 1.385569111 | 1.338553005 |
| 3 | Israel | ISR | Imports of goods and services (% of GDP) | NE.IMP.GNFS.ZS | 28.04105109 | 28.08902666 | 27.31812806 | 29.02104382 | 27.04794045 |
| 4 | Israel | ISR | Industry (including construction), value added... | NV.IND.TOTL.ZS | 19.82274245 | 18.97351387 | 18.38609253 | 18.61330645 | 18.63429673 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2123 | World | WLD | Imports of goods and services (% of GDP) | NE.IMP.GNFS.ZS | 27.59273887 | 26.62679695 | 27.56612193 | 28.53691444 | 27.77673399 |
| 2124 | World | WLD | Industry (including construction), value added... | NV.IND.TOTL.ZS | 26.83202447 | 26.28263514 | 26.75704068 | 27.22290626 | 26.69576164 |
| 2125 | World | WLD | Manufacturing, value added (% of GDP) | NV.IND.MANF.ZS | 16.39695325 | 16.20163734 | 16.26174169 | 16.38936987 | 15.98546945 |
| 2126 | World | WLD | Military expenditure (% of GDP) | MS.MIL.XPND.GD.ZS | 2.26043519 | 2.218343398 | 2.173431495 | 2.151094186 | 2.192591439 |
| 2127 | World | WLD | Tax revenue (% of GDP) | GC.TAX.TOTL.GD.ZS | 13.97879502 | 13.86900834 | 14.20141468 | 13.84647578 | 13.7989771 |
2128 rows × 9 columns
# Drop unnecessary columns
new_eco_data = new_eco_data.drop(['Country Code', 'Series Code'], axis=1)
# Reshape the dataframe using the melt function
new_eco_data_melted = pd.melt(new_eco_data, id_vars=['Country Name', 'Series Name'], var_name='year', value_name='value')
# Convert the year column to integer
new_eco_data_melted['year'] = new_eco_data_melted['year'].str.extract('(\d+)').astype(int)
# Convert the value column to numeric, coercing non-numeric values to NaN
new_eco_data_melted['value'] = pd.to_numeric(new_eco_data_melted['value'], errors='coerce')
# Pivot the dataframe to have 'Series Name' values as columns
new_eco_data_pivoted = new_eco_data_melted.pivot_table(index=['Country Name', 'year'], columns='Series Name', values='value', aggfunc='mean').reset_index()
# Rename the 'Country Name' column to 'Country' for consistency
new_eco_data_pivoted = new_eco_data_pivoted.rename(columns={'Country Name': 'Country'})
new_eco_data_pivoted
| Series Name | Country | year | Access to clean fuels and technologies for cooking (% of population) | Access to electricity (% of population) | Agriculture, forestry, and fishing, value added (% of GDP) | Imports of goods and services (% of GDP) | Industry (including construction), value added (% of GDP) | Manufacturing, value added (% of GDP) | Military expenditure (% of GDP) | Tax revenue (% of GDP) |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | 27.4 | 71.500000 | 20.634323 | NaN | 22.124042 | 11.420006 | 0.994576 | 7.585382 |
| 1 | Afghanistan | 2016 | 28.6 | 97.699997 | 25.740314 | NaN | 10.466808 | 4.114197 | 0.956772 | 9.502653 |
| 2 | Afghanistan | 2017 | 29.7 | 97.699997 | 26.420199 | NaN | 10.051874 | 3.530422 | 0.945227 | 9.898451 |
| 3 | Afghanistan | 2018 | 30.9 | 96.616135 | 22.042897 | NaN | 13.387247 | 6.160177 | 1.006746 | NaN |
| 4 | Afghanistan | 2019 | 31.9 | 97.699997 | 25.773971 | NaN | 14.058112 | 7.043181 | 1.118231 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1320 | Zimbabwe | 2015 | 29.5 | 33.700001 | 8.284247 | 37.588635 | 22.358392 | 11.888599 | 1.886876 | 17.673363 |
| 1321 | Zimbabwe | 2016 | 29.8 | 42.561729 | 7.873986 | 31.275493 | 22.115059 | 11.596020 | 1.720991 | 15.458341 |
| 1322 | Zimbabwe | 2017 | 29.8 | 44.178635 | 8.340969 | 30.370807 | 21.404999 | 11.017009 | 0.387831 | 15.874375 |
| 1323 | Zimbabwe | 2018 | 29.9 | 45.572647 | 7.319375 | 28.386297 | 31.037898 | 13.678137 | 0.309323 | 7.214765 |
| 1324 | Zimbabwe | 2019 | 30.1 | 46.781475 | 9.819262 | 25.524111 | 32.025947 | 14.222360 | 0.534804 | NaN |
1325 rows × 10 columns
# Find the countries that exist in both datasets
common_countries = set(combined_df['Country']).intersection(set(new_eco_data_pivoted['Country']))
# Filter the WHR data and the new economic data for the common countries
whr_data_common = combined_df[combined_df['Country'].isin(common_countries)]
new_eco_data_common = new_eco_data_pivoted[new_eco_data_pivoted['Country'].isin(common_countries)]
# Merge the WHR data and the new economic data for the common countries
merged_data = pd.merge(whr_data_common, new_eco_data_common, on=['Country', 'year'])
merged_data
| Country | year | Happiness Rank | Happiness Score | Economy (GDP per Capita) | Family | Health (Life Expectancy) | Freedom | Trust (Government Corruption) | Generosity | Dystopia Residual | Access to clean fuels and technologies for cooking (% of population) | Access to electricity (% of population) | Agriculture, forestry, and fishing, value added (% of GDP) | Imports of goods and services (% of GDP) | Industry (including construction), value added (% of GDP) | Manufacturing, value added (% of GDP) | Military expenditure (% of GDP) | Tax revenue (% of GDP) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Switzerland | 2015 | 1 | 7.587 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2.51738 | 100.0 | 100.000000 | 0.626446 | 53.309526 | 24.201126 | 17.118658 | 0.643891 | 9.556179 |
| 1 | Iceland | 2015 | 2 | 7.561 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2.70201 | 100.0 | 100.000000 | 5.280670 | 44.166486 | 20.195136 | 10.667453 | NaN | 22.689444 |
| 2 | Denmark | 2015 | 3 | 7.527 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2.49204 | 100.0 | 100.000000 | 0.957580 | 48.630054 | 19.987455 | 12.403445 | 1.111446 | 33.921619 |
| 3 | Norway | 2015 | 4 | 7.522 | 1.45900 | 1.33095 | 0.88521 | 0.66973 | 0.36503 | 0.34699 | 2.46531 | 100.0 | 100.000000 | 1.537558 | 32.057350 | 31.010379 | 6.873817 | 1.507280 | 22.223551 |
| 4 | Canada | 2015 | 5 | 7.427 | 1.32629 | 1.32261 | 0.90563 | 0.63297 | 0.32957 | 0.45811 | 2.45176 | 100.0 | 100.000000 | 1.869836 | 34.314941 | 24.415683 | 9.975423 | 1.152709 | 12.389811 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 672 | Rwanda | 2019 | 152 | 3.334 | 0.35900 | 0.71100 | 0.61400 | 0.55500 | 0.41100 | 0.21700 | NaN | 1.9 | 40.368607 | 23.545797 | 36.125688 | 18.864151 | 8.366595 | 1.239645 | 14.595448 |
| 673 | Tanzania | 2019 | 153 | 3.231 | 0.47600 | 0.88500 | 0.49900 | 0.41700 | 0.14700 | 0.27600 | NaN | 4.3 | 37.659687 | 26.546415 | 16.951259 | 28.620195 | 8.486499 | 1.017767 | NaN |
| 674 | Afghanistan | 2019 | 154 | 3.203 | 0.35000 | 0.51700 | 0.36100 | 0.00000 | 0.02500 | 0.15800 | NaN | 31.9 | 97.699997 | 25.773971 | NaN | 14.058112 | 7.043181 | 1.118231 | NaN |
| 675 | Central African Republic | 2019 | 155 | 3.083 | 0.02600 | 0.00000 | 0.10500 | 0.22500 | 0.03500 | 0.23500 | NaN | 0.7 | 14.300000 | 28.341832 | 34.310420 | 20.511971 | 17.782321 | 1.915529 | 8.318458 |
| 676 | South Sudan | 2019 | 156 | 2.853 | 0.30600 | 0.57500 | 0.29500 | 0.01000 | 0.09100 | 0.20200 | NaN | 0.0 | 6.707007 | NaN | NaN | NaN | NaN | 3.558185 | NaN |
677 rows × 19 columns
# Calculate the correlation matrix
corr_matrix = merged_data.corr()
# Plot the heatmap
plt.figure(figsize=(16, 12))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5, fmt='.2f')
plt.title("Correlation Heatmap")
plt.show()
import matplotlib.pyplot as plt
# Define the list of economic features you're interested in
economic_features = [
'Access to clean fuels and technologies for cooking (% of population)',
'Access to electricity (% of population)',
'Agriculture, forestry, and fishing, value added (% of GDP)',
'Imports of goods and services (% of GDP)',
'Industry (including construction), value added (% of GDP)',
'Manufacturing, value added (% of GDP)',
'Military expenditure (% of GDP)',
'Tax revenue (% of GDP)'
]
# Determine the number of rows for the subplot grid
n_rows = int(np.ceil(len(economic_features) / 2))
# Initialize the subplot grid
fig, axes = plt.subplots(n_rows, 2, figsize=(15, n_rows * 5))
# Flatten the axes array for easier indexing
axes = axes.flatten()
# Loop through the list and create a scatter plot for each feature
for i, feature in enumerate(economic_features):
ax = axes[i]
ax.scatter(merged_data[feature], merged_data['Happiness Score'])
ax.set_xlabel(feature)
ax.set_ylabel('Happiness Score')
ax.set_title(f'Happiness Score vs {feature}')
# If the number of features is odd, remove the last (empty) subplot
if len(economic_features) % 2 != 0:
fig.delaxes(axes[-1])
plt.tight_layout()
plt.show()
# Set a threshold for the percentage of missing values you are willing to accept
threshold = 0.7
# Drop columns with more missing values than the threshold
missing_column_ratio = merged_data.isnull().mean()
merged_data = merged_data.loc[:, missing_column_ratio <= threshold]
# Drop rows with more missing values than the threshold
missing_row_ratio = merged_data.isnull().mean(axis=1)
merged_data = merged_data.loc[missing_row_ratio <= threshold, :]
# Impute missing values using SimpleImputer from scikit-learn
from sklearn.impute import SimpleImputer
# Change the 'year' column to a string type
merged_data['year'] = merged_data['year'].astype(str)
# Use the mean imputation strategy for numeric columns
numeric_columns = merged_data.select_dtypes(include=['number']).columns
mean_imputer = SimpleImputer(strategy='mean')
merged_data[numeric_columns] = mean_imputer.fit_transform(merged_data[numeric_columns])
# Use the most frequent value (mode) imputation strategy for categorical columns
categorical_columns = merged_data.select_dtypes(include=['object']).columns
mode_imputer = SimpleImputer(strategy='most_frequent')
merged_data[categorical_columns] = mode_imputer.fit_transform(merged_data[categorical_columns])
merged_data
| Country | year | Happiness Rank | Happiness Score | Economy (GDP per Capita) | Family | Health (Life Expectancy) | Freedom | Trust (Government Corruption) | Generosity | Dystopia Residual | Access to clean fuels and technologies for cooking (% of population) | Access to electricity (% of population) | Agriculture, forestry, and fishing, value added (% of GDP) | Imports of goods and services (% of GDP) | Industry (including construction), value added (% of GDP) | Manufacturing, value added (% of GDP) | Military expenditure (% of GDP) | Tax revenue (% of GDP) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Switzerland | 2015 | 1.0 | 7.587 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2.517380 | 100.0 | 100.000000 | 0.626446 | 53.309526 | 24.201126 | 17.118658 | 0.643891 | 9.556179 |
| 1 | Iceland | 2015 | 2.0 | 7.561 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2.702010 | 100.0 | 100.000000 | 5.280670 | 44.166486 | 20.195136 | 10.667453 | 1.828786 | 22.689444 |
| 2 | Denmark | 2015 | 3.0 | 7.527 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2.492040 | 100.0 | 100.000000 | 0.957580 | 48.630054 | 19.987455 | 12.403445 | 1.111446 | 33.921619 |
| 3 | Norway | 2015 | 4.0 | 7.522 | 1.45900 | 1.33095 | 0.88521 | 0.66973 | 0.36503 | 0.34699 | 2.465310 | 100.0 | 100.000000 | 1.537558 | 32.057350 | 31.010379 | 6.873817 | 1.507280 | 22.223551 |
| 4 | Canada | 2015 | 5.0 | 7.427 | 1.32629 | 1.32261 | 0.90563 | 0.63297 | 0.32957 | 0.45811 | 2.451760 | 100.0 | 100.000000 | 1.869836 | 34.314941 | 24.415683 | 9.975423 | 1.152709 | 12.389811 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 672 | Rwanda | 2019 | 152.0 | 3.334 | 0.35900 | 0.71100 | 0.61400 | 0.55500 | 0.41100 | 0.21700 | 2.111105 | 1.9 | 40.368607 | 23.545797 | 36.125688 | 18.864151 | 8.366595 | 1.239645 | 14.595448 |
| 673 | Tanzania | 2019 | 153.0 | 3.231 | 0.47600 | 0.88500 | 0.49900 | 0.41700 | 0.14700 | 0.27600 | 2.111105 | 4.3 | 37.659687 | 26.546415 | 16.951259 | 28.620195 | 8.486499 | 1.017767 | 16.629946 |
| 674 | Afghanistan | 2019 | 154.0 | 3.203 | 0.35000 | 0.51700 | 0.36100 | 0.00000 | 0.02500 | 0.15800 | 2.111105 | 31.9 | 97.699997 | 25.773971 | 44.190469 | 14.058112 | 7.043181 | 1.118231 | 16.629946 |
| 675 | Central African Republic | 2019 | 155.0 | 3.083 | 0.02600 | 0.00000 | 0.10500 | 0.22500 | 0.03500 | 0.23500 | 2.111105 | 0.7 | 14.300000 | 28.341832 | 34.310420 | 20.511971 | 17.782321 | 1.915529 | 8.318458 |
| 676 | South Sudan | 2019 | 156.0 | 2.853 | 0.30600 | 0.57500 | 0.29500 | 0.01000 | 0.09100 | 0.20200 | 2.111105 | 0.0 | 6.707007 | 10.659440 | 44.190469 | 26.156929 | 12.556008 | 3.558185 | 16.629946 |
677 rows × 19 columns
from sklearn.preprocessing import MinMaxScaler
# Initialize the MinMaxScaler
scaler = MinMaxScaler()
# Fit and transform the numeric columns
merged_data[numeric_columns] = scaler.fit_transform(merged_data[numeric_columns])
merged_data
| Country | year | Happiness Rank | Happiness Score | Economy (GDP per Capita) | Family | Health (Life Expectancy) | Freedom | Trust (Government Corruption) | Generosity | Dystopia Residual | Access to clean fuels and technologies for cooking (% of population) | Access to electricity (% of population) | Agriculture, forestry, and fishing, value added (% of GDP) | Imports of goods and services (% of GDP) | Industry (including construction), value added (% of GDP) | Manufacturing, value added (% of GDP) | Military expenditure (% of GDP) | Tax revenue (% of GDP) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Switzerland | 2015 | 0.000000 | 0.964145 | 0.666274 | 0.820870 | 0.825092 | 0.919296 | 0.760595 | 0.354121 | 0.618377 | 1.000 | 1.000000 | 0.009843 | 0.304479 | 0.351903 | 0.328876 | 0.042237 | 0.254065 |
| 1 | Iceland | 2015 | 0.006369 | 0.959023 | 0.621336 | 0.852938 | 0.830710 | 0.868467 | 0.256292 | 0.520598 | 0.671742 | 1.000 | 1.000000 | 0.086670 | 0.251580 | 0.280144 | 0.190658 | 0.131723 | 0.603235 |
| 2 | Denmark | 2015 | 0.012739 | 0.952325 | 0.632385 | 0.827603 | 0.766556 | 0.896934 | 0.876175 | 0.407350 | 0.611053 | 1.000 | 1.000000 | 0.015309 | 0.277405 | 0.276424 | 0.227852 | 0.077548 | 0.901862 |
| 3 | Norway | 2015 | 0.019108 | 0.951340 | 0.696088 | 0.809580 | 0.775819 | 0.925041 | 0.661394 | 0.414032 | 0.603328 | 1.000 | 1.000000 | 0.024883 | 0.181519 | 0.473877 | 0.109379 | 0.107442 | 0.590849 |
| 4 | Canada | 2015 | 0.025478 | 0.932624 | 0.632772 | 0.804507 | 0.793716 | 0.874268 | 0.597144 | 0.546622 | 0.599411 | 1.000 | 1.000000 | 0.030368 | 0.194581 | 0.355746 | 0.175831 | 0.080664 | 0.329402 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 672 | Rwanda | 2019 | 0.961783 | 0.126281 | 0.171279 | 0.432482 | 0.538124 | 0.766575 | 0.744687 | 0.258927 | 0.500950 | 0.019 | 0.377543 | 0.388169 | 0.205058 | 0.256302 | 0.141362 | 0.087230 | 0.388042 |
| 673 | Tanzania | 2019 | 0.968153 | 0.105989 | 0.227099 | 0.538321 | 0.437336 | 0.575967 | 0.266348 | 0.329326 | 0.500950 | 0.043 | 0.349266 | 0.437700 | 0.094119 | 0.431062 | 0.143931 | 0.070473 | 0.442133 |
| 674 | Afghanistan | 2019 | 0.974522 | 0.100473 | 0.166985 | 0.314477 | 0.316389 | 0.000000 | 0.045297 | 0.188527 | 0.500950 | 0.319 | 0.975992 | 0.424949 | 0.251718 | 0.170212 | 0.113008 | 0.078060 | 0.442133 |
| 675 | Central African Republic | 2019 | 0.980892 | 0.076832 | 0.012405 | 0.000000 | 0.092025 | 0.310773 | 0.063416 | 0.280404 | 0.500950 | 0.007 | 0.105428 | 0.467337 | 0.194555 | 0.285820 | 0.343095 | 0.138275 | 0.221158 |
| 676 | South Sudan | 2019 | 0.987261 | 0.031521 | 0.145992 | 0.349757 | 0.258545 | 0.013812 | 0.164882 | 0.241029 | 0.500950 | 0.000 | 0.026169 | 0.175456 | 0.251718 | 0.386937 | 0.231121 | 0.262332 | 0.442133 |
677 rows × 19 columns
# One-hot encode categorical variables
merged_data = pd.get_dummies(merged_data, columns=categorical_columns, drop_first=True)
merged_data
| Happiness Rank | Happiness Score | Economy (GDP per Capita) | Family | Health (Life Expectancy) | Freedom | Trust (Government Corruption) | Generosity | Dystopia Residual | Access to clean fuels and technologies for cooking (% of population) | ... | Country_United States | Country_Uruguay | Country_Uzbekistan | Country_Vietnam | Country_Zambia | Country_Zimbabwe | year_2016 | year_2017 | year_2018 | year_2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.964145 | 0.666274 | 0.820870 | 0.825092 | 0.919296 | 0.760595 | 0.354121 | 0.618377 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0.006369 | 0.959023 | 0.621336 | 0.852938 | 0.830710 | 0.868467 | 0.256292 | 0.520598 | 0.671742 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0.012739 | 0.952325 | 0.632385 | 0.827603 | 0.766556 | 0.896934 | 0.876175 | 0.407350 | 0.611053 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0.019108 | 0.951340 | 0.696088 | 0.809580 | 0.775819 | 0.925041 | 0.661394 | 0.414032 | 0.603328 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0.025478 | 0.932624 | 0.632772 | 0.804507 | 0.793716 | 0.874268 | 0.597144 | 0.546622 | 0.599411 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 672 | 0.961783 | 0.126281 | 0.171279 | 0.432482 | 0.538124 | 0.766575 | 0.744687 | 0.258927 | 0.500950 | 0.019 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 673 | 0.968153 | 0.105989 | 0.227099 | 0.538321 | 0.437336 | 0.575967 | 0.266348 | 0.329326 | 0.500950 | 0.043 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 674 | 0.974522 | 0.100473 | 0.166985 | 0.314477 | 0.316389 | 0.000000 | 0.045297 | 0.188527 | 0.500950 | 0.319 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 675 | 0.980892 | 0.076832 | 0.012405 | 0.000000 | 0.092025 | 0.310773 | 0.063416 | 0.280404 | 0.500950 | 0.007 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 676 | 0.987261 | 0.031521 | 0.145992 | 0.349757 | 0.258545 | 0.013812 | 0.164882 | 0.241029 | 0.500950 | 0.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
677 rows × 162 columns
def clean_dataset(dataset):
"""
Cleans a dataset by filling missing values with the mean,
replacing infinite values with NaN, and again filling these
with the mean.
Parameters:
- dataset: The DataFrame to clean.
Returns:
- The cleaned DataFrame.
"""
# Replace missing values with the column mean
dataset = dataset.fillna(dataset.mean())
# Replace infinite values with NaN
dataset = dataset.replace([np.inf, -np.inf], np.nan)
# Again replace any new missing values with the column mean
dataset = dataset.fillna(dataset.mean())
return dataset
def check_infinity_and_nan(dataset):
"""
Checks a DataFrame for any infinity or NaN values and prints the count.
Parameters:
- dataset: The DataFrame to check.
"""
print('Infinity:\n', np.isinf(dataset).sum())
print('\nNaN:\n', dataset.isna().sum())
merged_data = clean_dataset(merged_data)
check_infinity_and_nan(merged_data)
Infinity:
Happiness Rank 0
Happiness Score 0
Economy (GDP per Capita) 0
Family 0
Health (Life Expectancy) 0
..
Country_Zimbabwe 0
year_2016 0
year_2017 0
year_2018 0
year_2019 0
Length: 162, dtype: int64
NaN:
Happiness Rank 0
Happiness Score 0
Economy (GDP per Capita) 0
Family 0
Health (Life Expectancy) 0
..
Country_Zimbabwe 0
year_2016 0
year_2017 0
year_2018 0
year_2019 0
Length: 162, dtype: int64
# Before proceeding with feature engineering or dimensionality reduction,
# make sure to clean the dataset and handle any missing or problematic values.
merged_data = merged_data.fillna(merged_data.mean())
# Replace infinite values with a suitable value, such as NaN or a large number:
# Replace infinite values with NaN
merged_data = merged_data.replace([np.inf, -np.inf], np.nan)
# Impute missing values (NaN) using the mean of the column
merged_data = merged_data.fillna(merged_data.mean())
# Check for infinite values in the dataset
print('inifinity:\n',np.isinf(merged_data).sum())
print('\nNaN:\n',merged_data.isna().sum())
inifinity:
Happiness Rank 0
Happiness Score 0
Economy (GDP per Capita) 0
Family 0
Health (Life Expectancy) 0
..
Country_Zimbabwe 0
year_2016 0
year_2017 0
year_2018 0
year_2019 0
Length: 162, dtype: int64
NaN:
Happiness Rank 0
Happiness Score 0
Economy (GDP per Capita) 0
Family 0
Health (Life Expectancy) 0
..
Country_Zimbabwe 0
year_2016 0
year_2017 0
year_2018 0
year_2019 0
Length: 162, dtype: int64
# Create a new feature called 'Economic Growth Rate'
merged_data['Economic Growth Rate'] = merged_data['Economy (GDP per Capita)'].pct_change()
merged_data = clean_dataset(merged_data)
merged_data
# Create other new features as needed
# merged_data['New Feature'] = ...
| Happiness Rank | Happiness Score | Economy (GDP per Capita) | Family | Health (Life Expectancy) | Freedom | Trust (Government Corruption) | Generosity | Dystopia Residual | Access to clean fuels and technologies for cooking (% of population) | ... | Country_Uruguay | Country_Uzbekistan | Country_Vietnam | Country_Zambia | Country_Zimbabwe | year_2016 | year_2017 | year_2018 | year_2019 | Economic Growth Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.964145 | 0.666274 | 0.820870 | 0.825092 | 0.919296 | 0.760595 | 0.354121 | 0.618377 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.363053 |
| 1 | 0.006369 | 0.959023 | 0.621336 | 0.852938 | 0.830710 | 0.868467 | 0.256292 | 0.520598 | 0.671742 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.067447 |
| 2 | 0.012739 | 0.952325 | 0.632385 | 0.827603 | 0.766556 | 0.896934 | 0.876175 | 0.407350 | 0.611053 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.017784 |
| 3 | 0.019108 | 0.951340 | 0.696088 | 0.809580 | 0.775819 | 0.925041 | 0.661394 | 0.414032 | 0.603328 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.100733 |
| 4 | 0.025478 | 0.932624 | 0.632772 | 0.804507 | 0.793716 | 0.874268 | 0.597144 | 0.546622 | 0.599411 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.090960 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 672 | 0.961783 | 0.126281 | 0.171279 | 0.432482 | 0.538124 | 0.766575 | 0.744687 | 0.258927 | 0.500950 | 0.019 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.879581 |
| 673 | 0.968153 | 0.105989 | 0.227099 | 0.538321 | 0.437336 | 0.575967 | 0.266348 | 0.329326 | 0.500950 | 0.043 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.325905 |
| 674 | 0.974522 | 0.100473 | 0.166985 | 0.314477 | 0.316389 | 0.000000 | 0.045297 | 0.188527 | 0.500950 | 0.319 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | -0.264706 |
| 675 | 0.980892 | 0.076832 | 0.012405 | 0.000000 | 0.092025 | 0.310773 | 0.063416 | 0.280404 | 0.500950 | 0.007 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | -0.925714 |
| 676 | 0.987261 | 0.031521 | 0.145992 | 0.349757 | 0.258545 | 0.013812 | 0.164882 | 0.241029 | 0.500950 | 0.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 10.769231 |
677 rows × 163 columns
from sklearn.decomposition import PCA
# Initialize PCA with the desired number of components
pca = PCA(n_components=10)
# Fit and transform the dataset
merged_data_pca = pca.fit_transform(merged_data)
# Check the explained variance ratio of each principal component
explained_variance_ratio = pca.explained_variance_ratio_
# Convert the PCA-transformed data back to a DataFrame
merged_data_pca = pd.DataFrame(merged_data_pca, columns=[f'PC{i+1}' for i in range(pca.n_components_)])
merged_data_pca
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.001604 | -0.897843 | 0.061916 | 0.040008 | 0.032124 | 0.589098 | 0.156907 | 0.357971 | -0.056216 | 0.119451 |
| 1 | -0.433440 | -0.827133 | 0.040206 | 0.026291 | 0.015233 | 0.384663 | -0.178413 | 0.312580 | -0.301090 | -0.060678 |
| 2 | -0.345496 | -0.905308 | 0.068750 | 0.050674 | 0.027472 | 0.653413 | -0.252762 | 0.437998 | 0.205310 | 0.124263 |
| 3 | -0.258872 | -0.879471 | 0.060593 | 0.046677 | 0.031109 | 0.534516 | 0.042242 | 0.353102 | -0.010900 | 0.179446 |
| 4 | -0.455582 | -0.820864 | 0.060193 | 0.049072 | 0.021048 | 0.525421 | 0.102027 | 0.344379 | -0.160712 | 0.079033 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 672 | 0.521800 | 0.916005 | -0.289386 | -0.627831 | -0.573316 | 0.549434 | 0.089255 | 0.026940 | 0.467882 | -0.004951 |
| 673 | -0.032955 | 0.998668 | -0.321348 | -0.630206 | -0.591285 | 0.199880 | 0.107224 | -0.105473 | 0.139383 | -0.044785 |
| 674 | -0.623781 | 0.815838 | -0.214755 | -0.639372 | -0.621989 | -0.475490 | -0.185324 | 0.025796 | 0.006850 | -0.072506 |
| 675 | -1.288647 | 1.480610 | -0.235654 | -0.627348 | -0.599626 | 0.028333 | 0.095488 | -0.015831 | -0.032358 | 0.125122 |
| 676 | 10.408402 | 1.315980 | -0.232931 | -0.678948 | -0.558864 | -0.062395 | -0.133569 | -0.107370 | 0.178773 | 0.420160 |
677 rows × 10 columns
from sklearn.model_selection import train_test_split
# Define target variable (e.g., 'Happiness Score') and the feature matrix
X = merged_data_pca
y = merged_data['Happiness Score']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
# Initialize the models
lr = LinearRegression()
dt = DecisionTreeRegressor()
rf = RandomForestRegressor()
xgb = XGBRegressor()
from sklearn.model_selection import cross_val_score
# Calculate the cross-validated mean squared error for each model
lr_mse = -cross_val_score(lr, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean()
dt_mse = -cross_val_score(dt, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean()
rf_mse = -cross_val_score(rf, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean()
xgb_mse = -cross_val_score(xgb, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean()
print(f'Linear Regression MSE: {lr_mse}')
print(f'Decision Tree MSE: {dt_mse}')
print(f'Random Forest MSE: {rf_mse}')
print(f'XGBoost MSE: {xgb_mse}')
Linear Regression MSE: 0.0008087020804494518 Decision Tree MSE: 0.005151870187877138 Random Forest MSE: 0.0021875285783169423 XGBoost MSE: 0.002070645120960241
from sklearn.model_selection import GridSearchCV
# Specify the hyperparameters and their possible values
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
# Initialize a GridSearchCV object for the RandomForestRegressor
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
# Fit the grid search to the data
grid_search.fit(X_train, y_train)
# Get the best parameters and the corresponding mean squared error
best_params = grid_search.best_params_
best_mse = -grid_search.best_score_
print(f'Best Parameters: {best_params}')
print(f'Best MSE: {best_mse}')
Best Parameters: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 50}
Best MSE: 0.0021813998981147407
param_grid_xgb = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 1],
'max_depth': [3, 5, 8]
}
grid_search_xgb = GridSearchCV(xgb, param_grid_xgb, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search_xgb.fit(X_train, y_train)
best_params_xgb = grid_search_xgb.best_params_
best_mse_xgb = -grid_search_xgb.best_score_
print(f'XGBoost Best Parameters: {best_params_xgb}')
print(f'XGBoost Best MSE: {best_mse_xgb}')
XGBoost Best Parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}
XGBoost Best MSE: 0.0015920512244952723
param_grid_dt = {
'max_depth': [None, 2, 5, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_search_dt = GridSearchCV(dt, param_grid_dt, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search_dt.fit(X_train, y_train)
best_params_dt = grid_search_dt.best_params_
best_mse_dt = -grid_search_dt.best_score_
print(f'Decision Tree Best Parameters: {best_params_dt}')
print(f'Decision Tree Best MSE: {best_mse_dt}')
Decision Tree Best Parameters: {'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2}
Decision Tree Best MSE: 0.005040350131762375
from sklearn.linear_model import Ridge
param_grid_lr = {
'alpha': [0.1, 0.5, 1.0, 5.0, 10.0]
}
lr = Ridge(random_state=42)
grid_search_lr = GridSearchCV(lr, param_grid_lr, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search_lr.fit(X_train, y_train)
best_params_lr = grid_search_lr.best_params_
best_mse_lr = -grid_search_lr.best_score_
print(f'Linear Regression Best Parameters: {best_params_lr}')
print(f'Linear Regression Best MSE: {best_mse_lr}')
Linear Regression Best Parameters: {'alpha': 0.1}
Linear Regression Best MSE: 0.0008100595594059633
# Fit the best model to the entire training set
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)
# Evaluate the model on the testing set
test_mse = -cross_val_score(best_model, X_test, y_test, cv=5, scoring='neg_mean_squared_error').mean()
print(f'Test MSE: {test_mse}')
print(f'Best model is:\n{best_model}')
Test MSE: 0.0035905463654408477 Best model is: RandomForestRegressor(max_depth=10, n_estimators=50)
from sklearn.metrics import r2_score
# Make predictions on the test set
y_pred = best_model.predict(X_test)
# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2}')
R-squared: 0.9604821182126816
# # Train the RandomForestRegressor on the original dataset
# X_original = merged_data.drop(['Country', 'year', 'Happiness Score'], axis=1)
# X_train_original, X_test_original, y_train_original, y_test_original = train_test_split(X_original, y, test_size=0.3, random_state=42)
# best_model_original = RandomForestRegressor(**best_params)
# best_model_original.fit(X_train_original, y_train_original)
# # Get feature importances
# feature_importances = best_model_original.feature_importances_
# # Map feature importances to their corresponding feature names
# importances_dict = dict(zip(X_original.columns, feature_importances))
# # Print feature importances sorted by importance
# sorted_importances = sorted(importances_dict.items(), key=lambda x: x[1], reverse=True)
# print('Feature Importances:')
# for feature, importance in sorted_importances:
# print(f'{feature}: {importance}')
merged_data
| Happiness Rank | Happiness Score | Economy (GDP per Capita) | Family | Health (Life Expectancy) | Freedom | Trust (Government Corruption) | Generosity | Dystopia Residual | Access to clean fuels and technologies for cooking (% of population) | ... | Country_Uruguay | Country_Uzbekistan | Country_Vietnam | Country_Zambia | Country_Zimbabwe | year_2016 | year_2017 | year_2018 | year_2019 | Economic Growth Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.964145 | 0.666274 | 0.820870 | 0.825092 | 0.919296 | 0.760595 | 0.354121 | 0.618377 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.363053 |
| 1 | 0.006369 | 0.959023 | 0.621336 | 0.852938 | 0.830710 | 0.868467 | 0.256292 | 0.520598 | 0.671742 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.067447 |
| 2 | 0.012739 | 0.952325 | 0.632385 | 0.827603 | 0.766556 | 0.896934 | 0.876175 | 0.407350 | 0.611053 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.017784 |
| 3 | 0.019108 | 0.951340 | 0.696088 | 0.809580 | 0.775819 | 0.925041 | 0.661394 | 0.414032 | 0.603328 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.100733 |
| 4 | 0.025478 | 0.932624 | 0.632772 | 0.804507 | 0.793716 | 0.874268 | 0.597144 | 0.546622 | 0.599411 | 1.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.090960 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 672 | 0.961783 | 0.126281 | 0.171279 | 0.432482 | 0.538124 | 0.766575 | 0.744687 | 0.258927 | 0.500950 | 0.019 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.879581 |
| 673 | 0.968153 | 0.105989 | 0.227099 | 0.538321 | 0.437336 | 0.575967 | 0.266348 | 0.329326 | 0.500950 | 0.043 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.325905 |
| 674 | 0.974522 | 0.100473 | 0.166985 | 0.314477 | 0.316389 | 0.000000 | 0.045297 | 0.188527 | 0.500950 | 0.319 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | -0.264706 |
| 675 | 0.980892 | 0.076832 | 0.012405 | 0.000000 | 0.092025 | 0.310773 | 0.063416 | 0.280404 | 0.500950 | 0.007 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | -0.925714 |
| 676 | 0.987261 | 0.031521 | 0.145992 | 0.349757 | 0.258545 | 0.013812 | 0.164882 | 0.241029 | 0.500950 | 0.000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 10.769231 |
677 rows × 163 columns
# Train the RandomForestRegressor on the original dataset without Happiness Rank, Economy (GDP per Capita), Score, Countries and Years
cols_to_drop = [col for col in merged_data.columns if col.startswith('Country_') or col.startswith('year_')]
X_original = merged_data.drop(['Happiness Score', 'Happiness Rank','Economy (GDP per Capita)'] + cols_to_drop, axis=1)
X_train_original, X_test_original, y_train_original, y_test_original = train_test_split(X_original, y, test_size=0.3, random_state=42)
best_model_original = RandomForestRegressor(**best_params)
best_model_original.fit(X_train_original, y_train_original)
RandomForestRegressor(max_depth=10, n_estimators=50)
import plotly.express as px
def plot_importances(importances_dict):
sorted_importances = sorted(importances_dict.items(), key=lambda x: x[1], reverse=True)
features, importances = zip(*sorted_importances)
fig = px.pie(values=importances, names=features, title='Feature Importances for: Random Forest Regressor')
fig.show()
# Get feature importances
feature_importances = best_model_original.feature_importances_
# Map feature importances to their corresponding feature names
importances_dict = dict(zip(X_original.columns, feature_importances))
# Print feature importances sorted by importance
sorted_importances = sorted(importances_dict.items(), key=lambda x: x[1], reverse=True)
print('Feature Importances:')
for feature, importance in sorted_importances:
print(f'{feature}: {importance} %')
plot_importances(importances_dict)
Feature Importances: Access to clean fuels and technologies for cooking (% of population): 0.386278805555829 % Health (Life Expectancy): 0.19046643516328177 % Access to electricity (% of population): 0.07849470745180914 % Dystopia Residual: 0.05032361818603281 % Freedom: 0.04710813004971819 % Economic Growth Rate: 0.04611105108709464 % Trust (Government Corruption): 0.045161726220380255 % Agriculture, forestry, and fishing, value added (% of GDP): 0.04355008547786852 % Generosity: 0.03727177548619021 % Family: 0.026481611453014052 % Military expenditure (% of GDP): 0.014167250862454859 % Imports of goods and services (% of GDP): 0.010776383816312081 % Industry (including construction), value added (% of GDP): 0.00832975056359191 % Manufacturing, value added (% of GDP): 0.008215329559466168 % Tax revenue (% of GDP): 0.007263339066956401 %
models = ['Linear Regression', 'Decision Tree', 'Random Forest', 'XGBoost']
mse_scores = {model: mse for model, mse in zip(models, [lr_mse, dt_mse, rf_mse, xgb_mse])}
r2_scores = {} # To be filled in next
models_dict = {model_name: model for model_name, model in zip(models, [lr, dt, rf, xgb])}
for model_name, model in models_dict.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2_scores[model_name] = r2_score(y_test, y_pred)
def print_feature_importances(model, model_name, X_train):
"""
Prints the feature importances for a trained model.
For linear regression, the coefficients are printed instead.
Parameters:
- model: The trained model.
- model_name: The name of the model (for print output).
- X_train: The training data (for feature names).
"""
if model_name == 'Linear Regression':
# Print coefficients for linear regression
coef_dict = dict(zip(X_train.columns, model.coef_))
sorted_coefs = sorted(coef_dict.items(), key=lambda x: x[1], reverse=True)
print(f'\n\nCoefficients for {model_name}:\n\n')
for feature, coef in sorted_coefs:
print(f'{feature}: {coef}')
else:
# Print feature importances for other models
importances_dict = dict(zip(X_train.columns, model.feature_importances_))
sorted_importances = sorted(importances_dict.items(), key=lambda x: x[1], reverse=True)
print(f'\n\nFeature importances for {model_name}:\n\n')
for feature, importance in sorted_importances:
print(f'{feature}: {importance}')
models_dict_original = {model_name: model for model_name, model in zip(models, [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(), XGBRegressor()])}
for model_name, model in models_dict_original.items():
model.fit(X_train_original, y_train_original)
print_feature_importances(model, model_name, X_train_original)
Coefficients for Linear Regression: Dystopia Residual: 0.615722025771025 Family: 0.27968905803885474 Health (Life Expectancy): 0.2500414687858842 Generosity: 0.22485449375906333 Access to clean fuels and technologies for cooking (% of population): 0.205703128515053 Trust (Government Corruption): 0.175419855616718 Freedom: 0.1471106043344794 Economic Growth Rate: -0.007316990402415932 Military expenditure (% of GDP): -0.030271089233538516 Access to electricity (% of population): -0.03041128258316493 Tax revenue (% of GDP): -0.03221993532440113 Industry (including construction), value added (% of GDP): -0.035334700151962134 Imports of goods and services (% of GDP): -0.08041622789312945 Agriculture, forestry, and fishing, value added (% of GDP): -0.08682005598235425 Manufacturing, value added (% of GDP): -0.08788564783656642 Feature importances for Decision Tree: Access to clean fuels and technologies for cooking (% of population): 0.6300861273190899 Generosity: 0.06214222662779826 Economic Growth Rate: 0.05377375953473972 Dystopia Residual: 0.04472877897951522 Access to electricity (% of population): 0.04428891758391074 Military expenditure (% of GDP): 0.030987715687357302 Trust (Government Corruption): 0.030629004562091205 Agriculture, forestry, and fishing, value added (% of GDP): 0.025844843930774682 Family: 0.020783009337161977 Freedom: 0.019850274180447693 Manufacturing, value added (% of GDP): 0.01575727445132297 Imports of goods and services (% of GDP): 0.009142684456000874 Health (Life Expectancy): 0.006646741134620046 Industry (including construction), value added (% of GDP): 0.0035912568445732925 Tax revenue (% of GDP): 0.0017473853705961646 Feature importances for Random Forest: Access to clean fuels and technologies for cooking (% of population): 0.41266388131451703 Health (Life Expectancy): 0.14280891602709447 Access to electricity (% of population): 0.10891086102096362 Dystopia Residual: 0.054181373770221815 Trust (Government Corruption): 0.05328009438653899 Economic Growth Rate: 0.044626913590183045 Freedom: 0.04257603252647297 Generosity: 0.03800499691554119 Agriculture, forestry, and fishing, value added (% of GDP): 0.03164657677448716 Family: 0.024854551045915493 Military expenditure (% of GDP): 0.014093274513078487 Imports of goods and services (% of GDP): 0.009429713617274937 Industry (including construction), value added (% of GDP): 0.007950614435742177 Tax revenue (% of GDP): 0.007558132788258923 Manufacturing, value added (% of GDP): 0.007414067273709492 Feature importances for XGBoost: Access to clean fuels and technologies for cooking (% of population): 0.6372100114822388 Health (Life Expectancy): 0.0824100598692894 Access to electricity (% of population): 0.05310436710715294 Economic Growth Rate: 0.04092632606625557 Dystopia Residual: 0.03346017003059387 Freedom: 0.03142310678958893 Agriculture, forestry, and fishing, value added (% of GDP): 0.030888257548213005 Generosity: 0.026628954336047173 Trust (Government Corruption): 0.021451303735375404 Military expenditure (% of GDP): 0.018731366842985153 Imports of goods and services (% of GDP): 0.008770945481956005 Family: 0.005468444898724556 Manufacturing, value added (% of GDP): 0.00361040816642344 Tax revenue (% of GDP): 0.0032728963997215033 Industry (including construction), value added (% of GDP): 0.0026434429455548525
import matplotlib.pyplot as plt
def plot_feature_importances(model, model_name, X_train):
"""
Plots the feature importances or coefficients for a trained model in a pie chart.
For linear regression, the absolute values of coefficients are plotted instead.
Parameters:
- model: The trained model.
- model_name: The name of the model (for print output).
- X_train: The training data (for feature names).
"""
if model_name == 'Linear Regression':
# Plot absolute coefficients for linear regression
coef_dict = dict(zip(X_train.columns, abs(model.coef_)))
else:
# Plot feature importances for other models
coef_dict = dict(zip(X_train.columns, model.feature_importances_))
# Sorting and taking top 10 features for neat pie chart
sorted_importances = sorted(coef_dict.items(), key=lambda x: x[1], reverse=True)[:10]
features, importances = zip(*sorted_importances)
# Creating pie chart
fig, ax = plt.subplots()
ax.pie(importances, labels=features, autopct='%1.1f%%')
plt.title(f'Feature Importances for {model_name}')
plt.show()
for model_name, model in models_dict.items():
model.fit(X_train, y_train)
plot_feature_importances(model, model_name, X_train)
C:\Users\user\AppData\Local\Temp/ipykernel_24508/662077362.py:27: MatplotlibDeprecationWarning: normalize=None does not normalize if the sum is less than 1 but this behavior is deprecated since 3.3 until two minor releases later. After the deprecation period the default value will be normalize=True. To prevent normalization pass normalize=False
plt.figure(figsize=(12, 6))
# Plot MSE scores
plt.subplot(1, 2, 1)
plt.bar(mse_scores.keys(), mse_scores.values())
plt.title('MSE Scores')
plt.ylabel('MSE')
plt.xticks(rotation=45)
# Plot R-squared scores
plt.subplot(1, 2, 2)
plt.bar(r2_scores.keys(), r2_scores.values())
plt.title('R-Squared Scores')
plt.ylabel('R-Squared')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
The goal of our project was to model and understand the key drivers of happiness levels across various countries using different machine learning methods: Linear Regression, Decision Trees, Random Forest, and XGBoost.
When interpreting the coefficients for the Linear Regression model, it's important to remember that they represent the change in the dependent variable (happiness level) for each one unit change in the predictor, assuming all other variables are held constant. For example, for each one unit increase in 'Dystopia Residual', we can expect an increase in happiness level of approximately 0.62 units, while keeping all other predictors constant.
Comparatively, the Decision Tree, Random Forest, and XGBoost models indicate feature importance rather than direct coefficients. These indicate which variables are most influential in predicting happiness level, based on their usage in creating splits in the decision trees.
The performance of our models was evaluated using Mean Squared Error (MSE) and R-squared statistics. A lower MSE indicates a better fit of the model to the data, while a higher R-squared value indicates that the model explains a larger proportion of the variance in the dependent variable.